0. Introduction

This script presents our group project during the BST209 course. We will show our workflow for training and testing a diabetes mellitus prediction model in ICU patients. An outline for the project workflow is as follows:

  1. Load required packages and gather data
  2. Split the data into training and testing sets
  3. Preprocess the data and remove missing values
  4. Build a random forest classifier using the training set
  5. Evaluate our model on the test set

Project importance: Why is it important to predict diabetes mellitus in ICU patients?

In the hospital, when patients are admitted to the ICU, providers are often unaware of their past medical history. So, our group’s mission was to build a model that can predict a diagnosis of diabetes mellitus in patients when they are first admitted to the ICU as this would improve clinical decision making and improve care of patients.

1. Getting started

1.1 Loaded required packages

First, we loaded the R packages necessary for our project.

1.2 Loaded the dataset

Accessed the data from PhysioNet from the WIDS Datathon Project Page. Downloaded ‘training_v2.csv’ from the “Files” section of the project.

Next we loaded in our data and started working with it.

1.3 Subset selection

We included eight predictors in our model: age, bmi, gender, ethnicity, glucose apache, H1 glucose max, D1 glucose max, and WBC apache.

bmi age gender glucose_apache diabetes_mellitus h1_glucose_max d1_glucose_max ethnicity wbc_apache
22.73 68 M 168 1 NA 168 Caucasian 14.1
27.42 77 F 145 1 145 145 Caucasian 12.7
31.95 25 F NA 0 NA NA Caucasian NA
22.64 81 F 185 0 NA 185 Caucasian 8.0
NA 19 M NA 0 NA NA Caucasian NA
27.56 67 M 156 1 NA 156 Caucasian 10.9

We also defined our outcome variable and removed the data with unknown outcomes

## [1] 91713
## [1] 90998

2. Creating the training and testing sets

2.2 Splitting the data

Creating the training and test datasets.

## # A tibble: 6 × 9
##     bmi   age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
##   <dbl> <dbl>  <dbl>          <dbl>          <dbl>          <dbl> <chr>    
## 1  22.7    68      1            168             NA            168 Caucasian
## 2  32.0    25      0             NA             NA             NA Caucasian
## 3  22.6    81      0            185             NA            185 Caucasian
## 4  57.4    59      0            197            197            197 Caucasian
## 5  NA      70      1            164             NA            129 Caucasian
## 6  NA      45      1            380            365            365 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>
## # A tibble: 6 × 9
##     bmi   age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
##   <dbl> <dbl>  <dbl>          <dbl>          <dbl>          <dbl> <chr>    
## 1  27.4    77      0            145            145            145 Caucasian
## 2  NA      19      1             NA             NA             NA Caucasian
## 3  27.6    67      1            156             NA            156 Caucasian
## 4  25.7    50      1            134            134            134 <NA>     
## 5  38.2    81      1            120            121            121 Caucasian
## 6  23.4    79      0            175             NA            175 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>

3. Summary statistics (“Table 1”)

Summary statistics of the training set
0 1 p test
n 56405 16394
bmi (mean (SD)) 28.41 (7.86) 31.82 (9.00) <0.001
age (mean (SD)) 61.57 (17.41) 64.62 (14.45) <0.001
gender = 1 (%) 30270 (53.7) 9044 (55.2) 0.001
glucose_apache (mean (SD)) 141.36 (73.15) 218.80 (112.45) <0.001
h1_glucose_max (mean (SD)) 147.92 (73.63) 214.95 (118.80) <0.001
d1_glucose_max (mean (SD)) 154.69 (68.89) 239.15 (104.81) <0.001
ethnicity (%) <0.001
African American 5620 (10.1) 1977 (12.2)
Asian 647 ( 1.2) 234 ( 1.4)
Caucasian 43997 (79.3) 12106 (74.6)
Hispanic 2279 ( 4.1) 748 ( 4.6)
Native American 419 ( 0.8) 218 ( 1.3)
Other/Unknown 2509 ( 4.5) 953 ( 5.9)
wbc_apache (mean (SD)) 12.11 (6.97) 12.19 (6.72) 0.243

4. Preprocessing the data

4.1 Imputing missing values

To deal with the NA values in our data, we decided to replace missing values with the median of the train_set.

Note that it is important that we do this after splitting into training and test sets. If we imputed values using test data, we would risk leaking information about our test set into our training set (“data leakage”).

4.2 Normalization

Since our tree model does not require normalization, so we skipped this step.

5. Model building

5.1 Training a random forest classifier

5.2 Viewing variable importance

Overall
age 47.0295952
bmi 67.4601747
gender 5.8278661
glucose_apache 80.2409013
h1_glucose_max 44.6706070
d1_glucose_max 100.0000000
ethnicityAsian 0.1917897
ethnicityCaucasian 4.1155818
ethnicityHispanic 1.2941783
ethnicityNative American 0.0000000
ethnicityOther/Unknown 1.4721757
wbc_apache 50.3555599

We can see that d1_glucose_max and glucose_apache had the most predictive power, followed by bmi.

5.3 Predicting diabetes mellitus in “unseen patients”

Selecting patients from the test set

  • Let’s look at some patients in our test set
## # A tibble: 5 × 8
##     age   bmi gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
##   <dbl> <dbl>  <dbl>          <dbl>          <dbl>          <dbl> <chr>    
## 1    77  27.4      0            145            145            145 Caucasian
## 2    19  27.7      1            133            140            150 Caucasian
## 3    67  27.6      1            156            140            156 Caucasian
## 4    50  25.7      1            134            134            134 Caucasian
## 5    81  38.2      1            120            121            121 Caucasian
## # … with 1 more variable: wbc_apache <dbl>

Predicting diabetes mellitus in unseen patients

####Original results:

age bmi gender glucose_apache h1_glucose_max d1_glucose_max ethnicity wbc_apache prediction truth_value
77 27.42000 0 145.0 145.000 145 Caucasian 12.7 0 1
19 27.65432 1 133.0 140.000 150 Caucasian 10.4 0 0
67 27.56000 1 156.0 140.000 156 Caucasian 10.9 0 1
50 25.71000 1 134.0 134.000 134 Caucasian 8.4 0 0
81 38.18907 1 120.0 121.000 121 Caucasian 6.9 0 0
79 23.40898 0 175.0 140.000 175 Caucasian 13.5 0 0
65 27.65432 0 205.0 140.000 208 Caucasian 10.1 0 1
60 26.48571 1 149.0 140.000 149 African American 7.5 0 0
71 34.30569 0 157.0 140.000 157 Caucasian 5.4 0 0
58 27.65432 0 88.0 140.000 88 Caucasian 6.7 0 0
85 21.88964 0 81.0 140.000 81 Caucasian 6.0 0 0
65 26.28642 1 361.0 140.000 361 African American 10.4 1 0
82 23.73812 1 121.0 140.000 121 Caucasian 8.5 0 0
64 16.98039 1 96.0 140.000 113 Caucasian 21.8 0 0
59 14.84493 0 71.0 140.000 182 African American 1.8 0 0
55 32.20536 1 81.0 95.000 95 Caucasian 8.2 0 0
62 27.26740 1 72.0 87.000 87 Caucasian 3.5 0 0
56 34.15440 0 133.0 140.000 150 Caucasian 10.4 0 0
73 20.35197 0 159.0 140.000 159 Caucasian 15.6 0 0
48 26.51648 1 96.0 140.000 96 Caucasian 5.6 0 0
65 33.00135 0 94.0 140.000 94 Caucasian 10.4 0 0
55 24.12382 1 114.0 140.000 114 Caucasian 4.1 0 0
71 35.07812 0 181.0 140.000 181 Caucasian 12.1 0 1
47 28.86719 0 76.0 82.000 121 Asian 4.7 0 0
75 27.65432 0 598.7 695.045 611 Caucasian 11.4 1 1
55 23.74531 0 160.0 192.000 192 African American 16.2 0 0
74 27.65432 0 324.0 140.000 324 Caucasian 18.0 1 1
68 18.45329 0 260.0 291.000 291 Caucasian 27.6 0 0
89 27.25875 1 106.0 140.000 106 Caucasian 15.4 0 0
69 34.75311 0 133.0 140.000 145 Caucasian 10.4 0 0

## [1] "auc:"
## [1] 0.8130455
## [1] "accuracy:"
## [1] 0.8068026
## [1] "recall:"
## [1] 0.612442
## [1] "specificity:"
## [1] 0.8390238

A dataframe to save results

Subgroup analysis (discrete variables):

library(ggplot2)

var_list <- c("gender", "ethnicity")


for (var in var_list)
{
print(var)
    
list <- unique(original_test_set[[var]])

for (i in list)
{
test_set <- original_test_set[original_test_set[[var]] == i, ]
num = dim(test_set)[1]

# Make predictions on test set
forest_pred <- predict(forest,
                       newdata = test_set,
                       type = "raw")

data.frame(age = test_set$age,
           bmi = test_set$bmi,
           gender = test_set$gender,
           glucose_apache = test_set$glucose_apache,
           h1_glucose_max = test_set$h1_glucose_max, 
           d1_glucose_max = test_set$d1_glucose_max,
           ethnicity = test_set$ethnicity,
           wbc_apache = test_set$wbc_apache,
           prediction = forest_pred,
           truth_value = test_set$outcome_variable) 

  
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",i,"(",num,")"))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]


# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
                test_set$outcome_variable,
                positive = "1")$tab


Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y      <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)

print(ggplot(data =  df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(paste(var,":",i,"(",num,")")) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_gradient(low="white", high="#009194") +
  theme_bw() + theme(legend.position = "none")+ coord_fixed())

accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4]) 
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])


df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = i))

print(paste(i,"(", num, "):"))
print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)

}
}
## [1] "gender"

## [1] "0 ( 8414 ):"
## [1] "auc:"
## [1] 0.8209621
## [1] "accuracy:"
## [1] 0.8168529
## [1] "recall:"
## [1] 0.6338148
## [1] "specificity:"
## [1] 0.8466215

## [1] "1 ( 9785 ):"
## [1] "auc:"
## [1] 0.8059802
## [1] "accuracy:"
## [1] 0.7985692
## [1] "recall:"
## [1] 0.5961675
## [1] "specificity:"
## [1] 0.832617
## [1] "ethnicity"

## [1] "Caucasian ( 14328 ):"
## [1] "auc:"
## [1] 0.8179748
## [1] "accuracy:"
## [1] 0.8161642
## [1] "recall:"
## [1] 0.6157135
## [1] "specificity:"
## [1] 0.8462712

## [1] "African American ( 1864 ):"
## [1] "auc:"
## [1] 0.777936
## [1] "accuracy:"
## [1] 0.7757511
## [1] "recall:"
## [1] 0.5917722
## [1] "specificity:"
## [1] 0.8133075

## [1] "Asian ( 245 ):"
## [1] "auc:"
## [1] 0.7922566
## [1] "accuracy:"
## [1] 0.7632653
## [1] "recall:"
## [1] 0.5833333
## [1] "specificity:"
## [1] 0.7942584

## [1] "Hispanic ( 735 ):"
## [1] "auc:"
## [1] 0.814394
## [1] "accuracy:"
## [1] 0.7768707
## [1] "recall:"
## [1] 0.5986395
## [1] "specificity:"
## [1] 0.8214286

## [1] "Native American ( 145 ):"
## [1] "auc:"
## [1] 0.8129614
## [1] "accuracy:"
## [1] 0.737931
## [1] "recall:"
## [1] 0.6285714
## [1] "specificity:"
## [1] 0.7727273

## [1] "Other/Unknown ( 882 ):"
## [1] "auc:"
## [1] 0.8015829
## [1] "accuracy:"
## [1] 0.7721088
## [1] "recall:"
## [1] 0.6373626
## [1] "specificity:"
## [1] 0.8071429

Subgroup analysis (continuous variables):

var_list <- c("age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")

for (var in var_list)
{
print(var)
var_range = df[df$name==var,"range"]

list <- unique(original_test_set[[var]])


for (i in 0:length(var_range)+1)
{
  

  
  
if(i==1){
  test_set <- original_test_set[original_test_set[[var]] < var_range[i], ]
}
else if(i==length(var_range)+1){
  test_set <- original_test_set[original_test_set[[var]] > var_range[i-1], ]
}
else{
  test_set <- original_test_set[original_test_set[[var]] > var_range[i-1] & original_test_set[[var]] < var_range[i], ]
}

num = dim(test_set)[1]

if(i==1){
  range_text = paste("[", i, "]", "<", var_range[i], "(",num, ")")
}
else if(i==length(var_range)+1){
  range_text = paste("[", i, "]", ">", var_range[i-1], "(",num, ")")
}
else{
  range_text = paste("[", i, "]", var_range[i-1],"~", var_range[i], "(",num, ")")
}
  
# Make predictions on test set
forest_pred <- predict(forest,
                       newdata = test_set,
                       type = "raw")

data.frame(age = test_set$age,
           bmi = test_set$bmi,
           gender = test_set$gender,
           glucose_apache = test_set$glucose_apache,
           h1_glucose_max = test_set$h1_glucose_max, 
           d1_glucose_max = test_set$d1_glucose_max,
           ethnicity = test_set$ethnicity,
           wbc_apache = test_set$wbc_apache,
           prediction = forest_pred,
           truth_value = test_set$outcome_variable) 

  
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",range_text))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]

print(paste(range_text,":"))


# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
                test_set$outcome_variable,
                positive = "1")$tab

Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y      <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)

print(ggplot(data =  df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(range_text) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_gradient(low="white", high="#009194") +
  theme_bw() + theme(legend.position = "none")+ coord_fixed())

accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4]) 
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])

df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = range_text))


print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)

}
}
## [1] "age"

## [1] "[ 1 ] < 20 ( 160 ) :"

## [1] "auc:"
## [1] 0.9911937
## [1] "accuracy:"
## [1] 0.9625
## [1] "recall:"
## [1] 0.7
## [1] "specificity:"
## [1] 1

## [1] "[ 2 ] 20 ~ 40 ( 1653 ) :"

## [1] "auc:"
## [1] 0.9303992
## [1] "accuracy:"
## [1] 0.9122807
## [1] "recall:"
## [1] 0.6790123
## [1] "specificity:"
## [1] 0.9376258

## [1] "[ 3 ] 40 ~ 60 ( 4691 ) :"

## [1] "auc:"
## [1] 0.8298128
## [1] "accuracy:"
## [1] 0.8166702
## [1] "recall:"
## [1] 0.5799419
## [1] "specificity:"
## [1] 0.857357

## [1] "[ 4 ] 60 ~ 80 ( 8360 ) :"

## [1] "auc:"
## [1] 0.7820416
## [1] "accuracy:"
## [1] 0.7767943
## [1] "recall:"
## [1] 0.6246334
## [1] "specificity:"
## [1] 0.8064608

## [1] "[ 5 ] > 80 ( 2420 ) :"

## [1] "auc:"
## [1] 0.773025
## [1] "accuracy:"
## [1] 0.8144628
## [1] "recall:"
## [1] 0.6103896
## [1] "specificity:"
## [1] 0.8359982
## [1] "bmi"

## [1] "[ 1 ] < 20 ( 1426 ) :"

## [1] "auc:"
## [1] 0.834415
## [1] "accuracy:"
## [1] 0.8870968
## [1] "recall:"
## [1] 0.5042735
## [1] "specificity:"
## [1] 0.921314

## [1] "[ 2 ] 20 ~ 30 ( 10204 ) :"

## [1] "auc:"
## [1] 0.815502
## [1] "accuracy:"
## [1] 0.8360447
## [1] "recall:"
## [1] 0.6160991
## [1] "specificity:"
## [1] 0.8591229

## [1] "[ 3 ] 30 ~ 40 ( 4930 ) :"

## [1] "auc:"
## [1] 0.7814041
## [1] "accuracy:"
## [1] 0.7561866
## [1] "recall:"
## [1] 0.6107249
## [1] "specificity:"
## [1] 0.7935254

## [1] "[ 4 ] > 40 ( 1636 ) :"

## [1] "auc:"
## [1] 0.7548043
## [1] "accuracy:"
## [1] 0.7114914
## [1] "recall:"
## [1] 0.6424242
## [1] "specificity:"
## [1] 0.7414549
## [1] "glucose_apache"

## [1] "[ 1 ] < 100 ( 4572 ) :"

## [1] "auc:"
## [1] 0.7937578
## [1] "accuracy:"
## [1] 0.861986
## [1] "recall:"
## [1] 0.6477273
## [1] "specificity:"
## [1] 0.8661909

## [1] "[ 2 ] 100 ~ 200 ( 9650 ) :"

## [1] "auc:"
## [1] 0.753888
## [1] "accuracy:"
## [1] 0.8596891
## [1] "recall:"
## [1] 0.5737705
## [1] "specificity:"
## [1] 0.8671061

## [1] "[ 3 ] 200 ~ 300 ( 2503 ) :"

## [1] "auc:"
## [1] 0.6470318
## [1] "accuracy:"
## [1] 0.6048742
## [1] "recall:"
## [1] 0.619211
## [1] "specificity:"
## [1] 0.592371

## [1] "[ 4 ] > 300 ( 1269 ) :"

## [1] "auc:"
## [1] 0.5876709
## [1] "accuracy:"
## [1] 0.6052009
## [1] "recall:"
## [1] 0.6152399
## [1] "specificity:"
## [1] 0.5533981
## [1] "h1_glucose_max"

## [1] "[ 1 ] < 100 ( 1193 ) :"

## [1] "auc:"
## [1] 0.8047401
## [1] "accuracy:"
## [1] 0.8365465
## [1] "recall:"
## [1] 0.6027397
## [1] "specificity:"
## [1] 0.8517857

## [1] "[ 2 ] 100 ~ 200 ( 15255 ) :"

## [1] "auc:"
## [1] 0.7951695
## [1] "accuracy:"
## [1] 0.8254998
## [1] "recall:"
## [1] 0.582517
## [1] "specificity:"
## [1] 0.8486502

## [1] "[ 3 ] 200 ~ 300 ( 1029 ) :"

## [1] "auc:"
## [1] 0.626052
## [1] "accuracy:"
## [1] 0.5986395
## [1] "recall:"
## [1] 0.6352941
## [1] "specificity:"
## [1] 0.5483871

## [1] "[ 4 ] > 300 ( 636 ) :"

## [1] "auc:"
## [1] 0.5722972
## [1] "accuracy:"
## [1] 0.6509434
## [1] "recall:"
## [1] 0.6672504
## [1] "specificity:"
## [1] 0.5076923
## [1] "d1_glucose_max"

## [1] "[ 1 ] < 100 ( 1889 ) :"

## [1] "auc:"
## [1] 0.7760518
## [1] "accuracy:"
## [1] 0.9534145
## [1] "recall:"
## [1] 1
## [1] "specificity:"
## [1] 0.9533898

## [1] "[ 2 ] 100 ~ 200 ( 11849 ) :"

## [1] "auc:"
## [1] 0.7331695
## [1] "accuracy:"
## [1] 0.8575407
## [1] "recall:"
## [1] 0.5461538
## [1] "specificity:"
## [1] 0.860995

## [1] "[ 3 ] 200 ~ 300 ( 2830 ) :"

## [1] "auc:"
## [1] 0.6437554
## [1] "accuracy:"
## [1] 0.5968198
## [1] "recall:"
## [1] 0.609736
## [1] "specificity:"
## [1] 0.5871446

## [1] "[ 4 ] > 300 ( 1456 ) :"

## [1] "auc:"
## [1] 0.5759749
## [1] "accuracy:"
## [1] 0.6112637
## [1] "recall:"
## [1] 0.625817
## [1] "specificity:"
## [1] 0.5344828
## [1] "wbc_apache"

## [1] "[ 1 ] < 5 ( 994 ) :"

## [1] "auc:"
## [1] 0.7947097
## [1] "accuracy:"
## [1] 0.8138833
## [1] "recall:"
## [1] 0.6168224
## [1] "specificity:"
## [1] 0.837655

## [1] "[ 2 ] 5 ~ 10 ( 5506 ) :"

## [1] "auc:"
## [1] 0.8156433
## [1] "accuracy:"
## [1] 0.806393
## [1] "recall:"
## [1] 0.6215236
## [1] "specificity:"
## [1] 0.8390682

## [1] "[ 3 ] 10 ~ 15 ( 7902 ) :"

## [1] "auc:"
## [1] 0.8221807
## [1] "accuracy:"
## [1] 0.8171349
## [1] "recall:"
## [1] 0.6213848
## [1] "specificity:"
## [1] 0.8501701

## [1] "[ 4 ] 15 ~ 20 ( 1960 ) :"

## [1] "auc:"
## [1] 0.792613
## [1] "accuracy:"
## [1] 0.7897959
## [1] "recall:"
## [1] 0.57
## [1] "specificity:"
## [1] 0.8295181

## [1] "[ 5 ] > 20 ( 1602 ) :"

## [1] "auc:"
## [1] 0.794275
## [1] "accuracy:"
## [1] 0.7734082
## [1] "recall:"
## [1] 0.5804598
## [1] "specificity:"
## [1] 0.7969188

## [1] "Gap:"
## [1] 0.01498192

## [1] "Gap:"
## [1] 0.04003881

## [1] "Gap:"
## [1] 0.2181687

## [1] "Gap:"
## [1] 0.07961078

## [1] "Gap:"
## [1] 0.2060869

## [1] "Gap:"
## [1] 0.2324429

## [1] "Gap:"
## [1] 0.2000769

## [1] "Gap:"
## [1] 0.02956778

## [1] "Gap:"
## [1] 0.01828363

## [1] "Gap:"
## [1] 0.07823312

## [1] "Gap:"
## [1] 0.1857057

## [1] "Gap:"
## [1] 0.1756053

## [1] "Gap:"
## [1] 0.2571119

## [1] "Gap:"
## [1] 0.2379071

## [1] "Gap:"
## [1] 0.3565947

## [1] "Gap:"
## [1] 0.04372666

## [1] "Gap:"
## [1] 0.03764729

## [1] "Gap:"
## [1] 0.0540293

## [1] "Gap:"
## [1] 0.1200581

## [1] "Gap:"
## [1] 0.1381507

## [1] "Gap:"
## [1] 0.07395678

## [1] "Gap:"
## [1] 0.08473348

## [1] "Gap:"
## [1] 0.4538462

## [1] "Gap:"
## [1] 0.05152358

## [1] "Gap:"
## [1] 0.01400453

## [1] "Gap:"
## [1] 0.0735439

## [1] "Gap:"
## [1] 0.1935392

## [1] "Gap:"
## [1] 0.1798591

## [1] "Gap:"
## [1] 0.313708

## [1] "Gap:"
## [1] 0.3440934

## [1] "Gap:"
## [1] 0.4189071

## [1] "Gap:"
## [1] 0.05325133

Additional steps. In addition to adding predictors and evaluating performance in patient subgroups, we built different model types, including:

Please see attached separate R Markdown notebooks.